16. Quality: Programmatic Assessment 2

Quality Programatic Assessment 2

Quiz

Using the results of the programmatic assessment in the Jupyter Notebook below, identify the results that are indicative of data quality issues in the following quizzes.

Quality: Programmatic Assessment

Which of the following part of the programmatic assessments in the Jupyter Notebook below are indicative of data quality issues? (Hint: Make sure to look for variations of the same name.)

SOLUTION:
  • Value count for the *surname* 'Doe' is 6
  • 'Jake Jakobsen' is a duplicated name
  • Lowest weight is 48.8 lbs
  • No null entries are returned from `sum` and `isnull` on the *auralin* and *novodra* columns

Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online code editor work space, etc.) and it cannot be automatically downloaded to be generated here. Please access the classroom with your account and manually download the workspace to your local machine. Note that for some courses, Udacity upload the workspace files onto https://github.com/udacity , so you may be able to download them there.

Workspace Information:

  • Default file path:
  • Workspace type: jupyter
  • Opened files (when workspace is loaded): n/a

Solution

Quality Programatic Assessment 2 Solution

*Note: while the default John Doe data is a validity issue as described in the video, it is also a completeness issue because this default data displaced real patient data that is no longer in the *patients* table. Because completeness is more "severe" than validity, completeness is likely the more appropriate data quality dimension. This distinction is more appropriate to note because missing data is usually best addressed first when cleaning data, as you'll experience in Lesson 4. However, let's assume that this overwritten data can't be recovered, which makes treating it as a validity issue okay.*

'Elizabeth Knudsen' being a duplicated name isn't a data quality issue because 'Elizabeth Knudsen' is not a duplicated name. Her demographic information, which is filled with NaN entries, are duplicated though (since there are patients records with missing address, city, state, etc. information.

The indexes of the series returned by sort_values on the weight column patients table are supposed to be out of order since the original dataset isn't sorted by weight.